ANOVA

5 minute read

Published:

This is a blog reviewing medical statistics.

Extending \(\chi^2\), Student’s \(t\), and \(F\) Distributions

Given an observation \((X_1, \dots, X_n) \sim N(\mu, \sigma^2)\), we already know how to construct the three important statistics for hypothesis testing, such as \(H_0: \mu = \mu_0, \sigma^2 = \sigma_0^2\). But what if we encounter a multivariate normal distribution?

\[\boldsymbol{X} \sim N(\boldsymbol{\mu}_p, \Sigma_p)\]

Can we extend the testing statistics to this case? Yes! Assume we have observations of the form:

\[\mathrm{X} = \{\boldsymbol{X}_1^\top, \dots, \boldsymbol{X}_n^\top\}^\top \in \mathbb{R}^{n \times p}\]

Wishart Distribution

Definition
If \(\boldsymbol{X} \sim N(\boldsymbol{0}_p, \Sigma_p)\), then: \(\sum_{i=1}^n \boldsymbol{X}_i \boldsymbol{X}_i^\top \sim W(\Sigma_p, n)\)

Properties:

  • Additivity: If \(W_1 \sim W(\Sigma_p, n_1)\) and \(W_2 \sim W(\Sigma_p, n_2)\) are independent, then: \(W_1 + W_2 \sim W(\Sigma_p, n_1 + n_2)\)

Hotelling Distribution

This is a generalization of the traditional Student’s \(t\)-distribution:

\[\left(\frac{\sqrt{n}(\bar{X} - \mu)}{S_n}\right)^2 \sim t^2_{n-1}\]

Definition
If \(\boldsymbol{X} \sim N(\boldsymbol{0}_p, \Sigma_p)\), \(W \sim W(\Sigma_p, n)\), and \(\boldsymbol{X}\) and \(W\) are independent, then: \(T^2 = n \boldsymbol{X}^\top W^{-1} \boldsymbol{X} \sim T^2(p, n)\)

Properties:

  • If \(T^2 \sim T^2(n, p)\), then: \(\frac{n - p + 1}{np} T^2 \sim F(p, n - p + 1)\)

    This proof is interesting. Let \(\boldsymbol{Z} = \Sigma^{-1/2} \boldsymbol{X} \sim N(\boldsymbol{0}_p, I_p)\). Then:

    \[\frac{n - p + 1}{np} T^2 = \frac{n - p + 1}{p} \boldsymbol{Z}^\top \left(\sum_{i=1}^n \boldsymbol{Z}_i \boldsymbol{Z}_i^\top\right)^{-1} \boldsymbol{Z}\]

    Furthermore, if we choose an orthogonal matrix \(\boldsymbol{U}\) such that for any \(\boldsymbol{z}\):

    \[\boldsymbol{U} \boldsymbol{z} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \|\boldsymbol{z}\|_2 \end{bmatrix}\]

    then:

    \[\begin{aligned} \frac{n - p + 1}{np} T^2 &\overset{d}{=} \frac{n - p + 1}{p} \begin{bmatrix} 0, 0, \dots, \|\boldsymbol{Z}\|_2 \end{bmatrix} \left(\boldsymbol{U} \sum_{i=1}^n \boldsymbol{Z}_i \boldsymbol{Z}_i^\top \boldsymbol{U}^\top\right)^{-1} \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \|\boldsymbol{Z}\|_2 \end{bmatrix} \\ &= \frac{\boldsymbol{Z}^\top \boldsymbol{Z} / p}{\boldsymbol{W}^{-1}_{pp} / (n - p + 1)} \sim F(p, n - p + 1) \end{aligned}\]

    where \(\boldsymbol{Z}^{\top} \boldsymbol{Z} \sim \chi^2(p)\) and \( W_{pp} = \left(\boldsymbol{U} \sum_{i=1}^n \boldsymbol{Z}_i {\boldsymbol{Z}}^{\top} \boldsymbol{U}^{\top} \right)_{pp} \sim \chi^2(n - p + 1)\).

Wilks Distribution

This is a generalization of the \(F\)-distribution:

\[F = \frac{\chi^2(m) / m}{\chi^2(n) / n}\]

:::
Definition
If \(W_1 \sim W(\Sigma_p, n_1)\), \(W_2 \sim W(\Sigma_p, n_2)\), and \(W_1\) and \(W_2\) are independent, then:

\[\Lambda = \frac{|W_1|}{|W_1 + W_2|} \sim \Lambda(p, n_1, n_2)\]

:::

Properties:

  • If \(\Lambda \sim \Lambda(p, n_1, n_2)\) and \(n_1 + n_2 \rightarrow \infty\), then: \(-r \ln \Lambda \rightarrow \chi^2(p n_2)\) where \(r = n_1 - \frac{1}{2}(p - n_2 + 1)\).

Fisher Lemma

Consider a multivariate random variable \(\boldsymbol{X} \sim N(\boldsymbol{\mu}_p, \Sigma_p)\), with observations \(\mathrm{X} = {\boldsymbol{X}_1^\top, \dots, \boldsymbol{X}_n^\top}^\top \in \mathbb{R}^{n \times p}\). Following the construction:

\[\begin{aligned} \bar{\boldsymbol{X}} &= \frac{1}{n} \sum_{i=1}^n \boldsymbol{X}_i = \frac{1}{n} \boldsymbol{1}_n^\top \mathrm{X}, \\ \boldsymbol{S} &= \frac{1}{n - 1} \sum_{i=1}^n (\boldsymbol{X}_i - \bar{\boldsymbol{X}})(\boldsymbol{X}_i - \bar{\boldsymbol{X}})^\top \\ &= \frac{1}{n - 1} \mathrm{X}^\top \left(I - \frac{1}{n} \boldsymbol{1}_n \boldsymbol{1}_n^\top\right) \mathrm{X} \in \mathbb{R}^{p \times p}. \end{aligned}\]
  • \(\bar{\boldsymbol{X}} \sim N(\boldsymbol{\mu}_p, \frac{1}{n} \Sigma_p)\)
  • \(\boldsymbol{S} \sim W(\Sigma_p, n - 1)\)
  • \(\bar{\boldsymbol{X}}\) and \(\boldsymbol{S}\) are independent.

These properties are crucial for data analysis.

One-Way Multivariate Analysis of Variance

For example, consider a dataset of children’s growth:

Height (cm)Weight (kg)Chest(cm)
135.630.567.5
140.531.569.5
138.225.667.8
135.932.562.5

We can construct the Hotelling statistic \(T^2\) to test:

\[H_0: \boldsymbol{\mu} = \boldsymbol{\mu}_0\]

where:

\[T^2 = n (\bar{\boldsymbol{X}} - \boldsymbol{\mu}_0)^\top \boldsymbol{S}^{-1} (\bar{\boldsymbol{X}} - \boldsymbol{\mu}_0) \sim T^2(p, n - 1)\]

Multi-factor Analysis of Variance

Suppose we have multivariate data from \(p\) groups collected from different locations. We want to test:

\[H_0: \boldsymbol{\mu}_1 = \boldsymbol{\mu}_2 = \dots = \boldsymbol{\mu}_k\]

The data is structured as:

\[X^{(i)} = \begin{bmatrix} x_{11}^{(i)} & \cdots & x_{1p}^{(i)} \\ \vdots & & \vdots \\ x_{n_i 1}^{(i)} & \cdots & x_{n_i p}^{(i)} \end{bmatrix} = \begin{bmatrix} X_{(1)}^{(i)\top} \\ \vdots \\ X_{(n_i)}^{(i)\top} \end{bmatrix} \quad (i = 1, \dots, k).\]

Here, \(X^{(i)}\) represents observations from the \(i\)-th group. We define:

\[\begin{aligned} \text{SST} &= \sum_{i=1}^k \sum_{j=1}^{n_i} (X_{(j)}^{(i)} - \bar{X})(X_{(j)}^{(i)} - \bar{X})^\top, \\ \text{SSE} &= \sum_{i=1}^k \sum_{j=1}^{n_i} (X_{(j)}^{(i)} - \bar{X}^{(i)})(X_{(j)}^{(i)} - \bar{X}^{(i)})^\top, \\ \text{SSA} &= \sum_{i=1}^k n_i (\bar{X}^{(i)} - \bar{X})(\bar{X}^{(i)} - \bar{X})^\top. \end{aligned}\]

Here, SST represents the total variance, SSE represents within-group variance, and SSA represents between-group variance. We have:

\[\begin{aligned} \text{SST} &\sim W(\Sigma_p, n - 1), \\ \text{SSE} &\sim W(\Sigma_p, n - k), \\ \text{SSA} &\sim W(\Sigma_p, k - 1). \end{aligned}\]

The rejection region is:

\[\frac{\text{SSA}}{\text{SST}} > c, \quad \text{then reject } H_0\]

where:

\[\frac{\text{SSA}}{\text{SST}} \sim \Lambda(p, k - 1, n - 1)\]

ANCOVA (Multivariate Analysis of Variance with Covariates)

Suppose we want to study whether a control factor influences a linear regression model:

\[Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon.\]

If we directly use univariate ANOVA to test for group differences among \(\boldsymbol{Y}^{(1)}, \dots, \boldsymbol{Y}^{(k)}\), we must control for the influence of \(X_1, \dots, X_p\). For example, if \(\boldsymbol{X}_j^{(1)}, \dots, \boldsymbol{X}_j^{(k)}\) exhibit significant between-group differences, we adjust the observations \(Y\) by:

\[Y'_i = \beta_0 + \beta_j (X_i - \bar{X}),\]

and then test \(\boldsymbol{Y}’^{(1)}, \dots, \boldsymbol{Y}’^{(k)}\).

Assumptions

  1. A linear relationship exists between \(X_j\) and \(Y\).
  2. The coefficients \(\beta_j^{(k)}\) share the same trend across groups.
  3. No collinearity exists between \(X_i\) and \(X_j\).
  4. No interaction exists between the factor \(k\) and \(X_j\).

Example:

 Explosion ≥10 Years  Explosion <10 Years 
\(x_1\)\(x_2\)\(y\)\(x_1\)\(x_2\)\(y\)
5522.655524.05
4324.025813.45
4024.154813.59
4124.114924.75
4823.885623.99
4723.785723.78
4913.255913.35
5123.614125.12
5223.554225.22
5323.454325.13